10 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training
A new neural network architecture called Mixture-of-Experts (MoE) has been
proposed recently that increases the parameters of a neural network (the base
model) by adding sparsely activated expert blocks, without changing the total
number of floating point operations for training or inference. In theory, this
architecture allows us to train arbitrarily large models while keeping the
computational costs same as that of the base model. However, beyond 64 to 128
experts blocks, prior work has observed diminishing returns in the test
accuracies of these MoE models. Thus, training high quality MoE models requires
us to scale the size of the base models, along with the number of expert
blocks. In this work, we propose a novel, three-dimensional, hybrid parallel
algorithm that combines tensor, expert, and data parallelism to enable the
training of MoE models with 4-8x larger base models than the current
state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the
optimizer step, and communication optimizations that eliminate redundant
movement of data. Removing these redundancies provides a speedup of nearly 21%.
When training a 40 billion parameter MoE model (6.7 billion base model with 16
experts) on 128 V100 GPUs, our optimizations significantly improve the peak
half precision flop/s from 20% to 27%
Intercloud Message Exchange Middleware
ABSTRACT Cloud Interoperability has been a core issue pertaining Intercloud and Cloud Federation. Several vendor-based proprietary solutions and open-source middleware are present for the resolution; however, these solutions are highly coupled to particular cloud environments. For heterogeneous clouds to exist in an interoperable environment, the need of a vendor-independent, secure and reliable message exchange middleware is critical. In this paper, considering general cloud architecture, we are presenting a Publish-Subscribe based middleware for Intercloud Message Exchange. Intercloud Message Exchange is an implementation of Data Distribution Service (DDS). DDS's reliable pub-sub messaging in conjunction with our devised Information Model can be a novel candidate for messaging domain of Intercloud Interoperability Standards. This Information Model also hosts an OWL based Cloud Resource Description Ontology, utilized by cloud environments for resource cataloguing and possible matchmaking prior to workload migration between heterogeneous clouds
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
ChatGPT-like models have revolutionized various applications in artificial
intelligence, from summarization and coding to translation, matching or even
surpassing human performance. However, the current landscape lacks an
accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement
Learning with Human Feedback) training pipeline for these powerful models,
particularly when training at the scale of billions of parameters. This paper
introduces DeepSpeed-Chat, a novel system that democratizes RLHF training,
making it accessible to the AI community. DeepSpeed-Chat offers three key
capabilities: an easy-to-use training and inference experience for ChatGPT-like
models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from
InstructGPT, and a robust DeepSpeed-RLHF system that combines various
optimizations for training and inference in a unified way. The system delivers
unparalleled efficiency and scalability, enabling training of models with
hundreds of billions of parameters in record time and at a fraction of the
cost. With this development, DeepSpeed-Chat paves the way for broader access to
advanced RLHF training, even for data scientists with limited resources,
thereby fostering innovation and further development in the field of AI.Comment: 14 pages, 7 figure
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Mixture-of-Experts (MoE) is a neural network architecture that
adds sparsely activated expert blocks to a base model, increasing
the number of parameters without impacting computational costs.
However, current distributed deep learning frameworks are limited
in their ability to train high-quality MoE models with large base
models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor,
and expert parallelism to enable the training of MoE models with
4–8× larger base models than the current state-of-the-art. We also
describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement.
We implement our approach in DeepSpeed and achieve speedups of
26% over a baseline (i.e. without our communication optimizations)
when training a 40 billion parameter MoE model (6.7 billion base
model with 16 experts) on 128 V100 GPUs.https://doi.org/10.1145/3577193.359370
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
As the training of giant dense models hits the boundary on the availability
and capability of the hardware resources today, Mixture-of-Experts (MoE) models
become one of the most promising model architectures due to their significant
training cost reduction compared to a quality-equivalent dense model. Its
training cost saving is demonstrated from encoder-decoder models (prior works)
to a 5x saving for auto-aggressive language models (this work along with
parallel explorations). However, due to the much larger model size and unique
architecture, how to provide fast MoE model inference remains challenging and
unsolved, limiting its practical usage. To tackle this, we present
DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the
DeepSpeed library, including novel MoE architecture designs and model
compression techniques that reduce MoE model size by up to 3.7x, and a highly
optimized inference system that provides 7.3x better latency and cost compared
to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented
scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x
cheaper inference compared to quality-equivalent dense models. We hope our
innovations and systems help open a promising path to new directions in the
large model landscape, a shift from dense to sparse MoE models, where training
and deploying higher-quality models with fewer resources becomes more widely
possible.Comment: This paper is published at ICML 2022:
https://proceedings.mlr.press/v162/rajbhandari22
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
The past several years have witnessed the success of transformer-based
models, and their scale and application scenarios continue to grow
aggressively. The current landscape of transformer models is increasingly
diverse: the model size varies drastically with the largest being of
hundred-billion parameters; the model characteristics differ due to the
sparsity introduced by the Mixture-of-Experts; the target application scenarios
can be latency-critical or throughput-oriented; the deployment hardware could
be single- or multi-GPU systems with different types of memory and storage,
etc. With such increasing diversity and the fast-evolving pace of transformer
models, designing a highly performant and efficient inference system is
extremely challenging. In this paper, we present DeepSpeed Inference, a
comprehensive system solution for transformer model inference to address the
above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU
inference solution to minimize latency while maximizing the throughput of both
dense and sparse transformer models when they fit in aggregate GPU memory, and
(2) a heterogeneous inference solution that leverages CPU and NVMe memory in
addition to the GPU memory and compute to enable high inference throughput with
large models which do not fit in aggregate GPU memory. DeepSpeed Inference
reduces latency by up to 7.3X over the state-of-the-art for latency-oriented
scenarios and increases throughput by over 1.5x for throughput-oriented
scenarios. Moreover, it enables trillion parameter scale inference under
real-time latency constraints by leveraging hundreds of GPUs, an unprecedented
scale for inference. It can inference 25x larger models than with GPU-only
solutions, while delivering a high throughput of 84 TFLOPS (over of
A6000 peak)